1. Data Set Preparation

	1.1 Create the folder /root/TrainingOnHDP/dataset/spark in your sandbox

	1.2 Upload all data files (including subfolders) into your sandbox /root/TrainingOnHDP/dataset/spark

	1.3 Login into Sandbox, and run the followings:

		cat /root/TrainingOnHDP/dataset/spark/movielens/ml-latest/ratings.csv.bz2* | bzip2 >/root/TrainingOnHDP/dataset/spark/movielens/ml-latest/ratings.csv.bz2

		hadoop fs -mkdir /root/labs
		hadoop fs -mkdir /root/labs/datasets
		hadoop fs -mkdir /root/labs/datasets/labs
		hadoop fs -put /root/TrainingOnHDP/dataset/spark /root/labs/datasets
		hadoop fs -chmod -R 777 /root/labs/datasets
		hadoop fs -put /root/TrainingOnHDP/dataset/spark/movielens/ml-latest/ratings.csv.bz2 /root/labs/datasets/movielens/ml-latest
		
2. Creating Example Hive Tables using Spark SQL		

	2.1 Run the following command to enter spark shell (make sure zeppelin service is stopped):
		
		spark-sql

	2.2 Run the following command to create a couple of hive tables (CREATED WITH A NON-HIVE-SUPPORTED SerDe LIKE com.databricks.spark.csv WILL NOT BE QUERYABLE BY HIVE):
	
	
		CREATE TABLE movies(movieId INT, title STRING, genres STRING) USING com.databricks.spark.csv OPTIONS (path "/root/labs/datasets/movielens/ml-latest/movies.csv.bz2", header "true");
		CREATE TABLE movie_ratings(userId INT, movieId INT, rating FLOAT, timestamp INT) USING com.databricks.spark.csv OPTIONS (path "/root/labs/datasets/movielens/ml-latest/ratings.csv.bz2", header "true");
		CREATE TABLE movie_tags(userId INT, movieId INT, tags STRING, timestamp INT) USING com.databricks.spark.csv OPTIONS (path "/root/labs/datasets/movielens/ml-latest/tags.csv.bz2", header "true");
		CREATE TABLE movie_links(movieId INT, imdbId INT, tbmdId INT) USING com.databricks.spark.csv OPTIONS (path "/root/labs/datasets/movielens/ml-latest/links.csv.bz2", header "true");
		CREATE TABLE dating_genders(userId INT, gender STRING) USING com.databricks.spark.csv OPTIONS (path "/root/labs/datasets/dating/genders.csv.bz2", header "true");
		CREATE TABLE dating_ratings(fromUserId INT, toUserId INT, rating INT) USING com.databricks.spark.csv OPTIONS (path "/root/labs/datasets/dating/ratings.csv.bz2", header "true");
	
3. Installation and Configuration of Cassandra

	3.1 Create one file named datastax.repo under /etc/yum.repos.d in your Sandbox
		
	3.2 Add the following to the file above:
		
		[datastax]
		name = DataStax Repo for Apache Cassandra
		baseurl = http://rpm.datastax.com/community
		enabled = 1
		gpgcheck = 0

	3.3 Run the follow comman in your Sandbox to install Cassandra:
			
		yum install dsc20
			
	3.4 Start-Up Cassandra

		service cassandra start

	3.5	Check Cassandra Service Status

		service cassandra status	
		
	3.6 Enter the Cassandra Command Line
			
		cqlsh
		
	3.7 Run the following command to create Cassandra keyspace and tables in Cassandra Shell:
		
		DROP KEYSPACE IF EXISTS sparklabs;
		CREATE KEYSPACE sparklabs WITH REPLICATION = {'class': 'SimpleStrategy',  'replication_factor':1};
		USE sparklabs; DROP TABLE IF EXISTS item_ratings;
		USE sparklabs; CREATE TABLE item_ratings(userid int, itemid int, rating int, timestamp bigint, geocity text, PRIMARY KEY(userid, itemid));
		USE sparklabs; COPY item_ratings (userid, itemid, rating, "timestamp", geocity) FROM '/root/TrainingOnHDP/dataset/graph/item-ratings-geo.csv';
		
	3.8 Change rpc_address to 0.0.0.0 on the file cassandra.yaml under /etc/cassandra/conf in your sandbox
	
	3.9 Run the following command to restart Cassandra:
	
		service cassandra stop
		service cassandra start
	
	3.10 Check the Cassandra Port is reachable:	
		
		nc -z localhost 9042

4. Installation and Configuration of Redis

	4.1 Run the follow comman in your Sandbox to install Redis:

		rpm -Uvh http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
		rpm -Uvh http://rpms.famillecollet.com/enterprise/remi-release-6.rpm
		yum --enablerepo=remi,remi-test install redis

	4.2 Run the following command to start/stop or check the status of the Redis:	

		service redis start/stop/restart/status

	4.3 Run the following command in your sandbox to enter Redis Shell:	
		
		redis-cli
		
	4.4 Run the following command in Redis Shell to retrieve the value ('topk' is the key, the return will be the value for this key):	
	
		get 'topk'
		
		
5. Installation and Configuration of Elasticsearch

	5.1 Stop Apache Atlas Service (since internally it launched old version of elasticsearch), the following command is used to check if the port 9200 is listening:
	
		netstat -tulpn
	
	5.2 Installation URL
		
		https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.1.1.tar.gz
	
	5.3 Upload elasticsearch-6.1.1.tar.gz to /root/TrainingOnHDP/ on HDP sandbox

	5.4 Login localhost:4200 and unpack the file at HDP sandbox

		cd /root/TrainingOnHDP/
		tar xvzf elasticsearch-6.1.1.tar.gz
		
	5.5 Open elasticsearch.yml at /root/TrainingOnHDP/elasticsearch-6.1.1/config and make the following port change (was 9200)

		http.port: 9200
			
	5.6 Start Elastic Search

		useradd elastic
		passwd elastic
		su elastic
			
		/root/TrainingOnHDP/elasticsearch-6.1.1/bin/elasticsearch		
	
	
6. Installation and Configuration of kibana

	6.1 Installation URL
		
		https://artifacts.elastic.co/downloads/kibana/kibana-6.1.1-linux-x86_64.tar.gz
	
	6.2 Upload kibana-6.1.1-linux-x86_64.tar.gz to /root/TrainingOnHDP/ on HDP sandbox

	6.3 Login localhost:4200 and unpack the file at HDP sandbox

		cd /root/TrainingOnHDP/
		tar xvzf kibana-6.1.1-linux-x86_64.tar.gz
			
	6.4 Open kibana.yml at /root/TrainingOnHDP/kibana-6.1.1-linux-x86_64 and make the following port change (was 5601)

		server.port: 8744
		server.host: "0.0.0.0"
			
	6.5 Start Kibana
		
		/root/TrainingOnHDP/kibana-6.1.1-linux-x86_64/bin/kibana
			
		
7. Installation and Configuration of logstash

	7.1 Create logstash.repo under /etc/yum.repos.d in yout sandbox for logstash, and add the following to this file:
	
		[logstash-2.2]
		name=logstash repository for 2.2 packages
		baseurl=http://packages.elasticsearch.org/logstash/2.2/centos
		gpgcheck=1
		gpgkey=http://packages.elasticsearch.org/GPG-KEY-elasticsearch
		enabled=1	
		
	7.2 Install logstash with this command:

		yum -y install logstash
		
	7.3	Configure Logstash
	
		Create a configuration file called 02-beats-input.conf under /etc/logstash/conf.d and add the following to this file to set up our "filebeat" input:
		
		input {
			beats {
				port => 5044
				ssl => false
			}
		}
		
		Create a configuration file called 30-elasticsearch-output.conf under /etc/logstash/conf.d and add the following to this file to set up our elasticsearch output
		
		output {
			elasticsearch {
				hosts => ["sandbox.hortonworks.com:9200"]
				sniffing => true
				manage_template => false
				index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
				document_type => "%{[@metadata][type]}"
			}
		}	
		
		Test your Logstash configuration with this command:
		
		service logstash configtest
		
	
			
	7.4 Run the following command to start/stop or check the status of logstash:

		service logstash start
		service logstash stop
		service logstash status		
		
8. Load Kibana Dashboards

	8.1 Run the following command to download the sample dashboards:
	
		cd ~
		curl -L -O https://download.elastic.co/beats/dashboards/beats-dashboards-1.1.0.zip
		
	8.2 Extract the contents of the archive:

		unzip beats-dashboards-1.1.0.zip
		
	8.3 Load the sample dashboards, visualizations and Beats index patterns into Elasticsearch with these commands:

		cd beats-dashboards-1.1.0
		./load.sh

		
9. Load Filebeat Index Template in Elasticsearch

	9.1 Download the Filebeat index template:
	
		cd ~
		curl -O https://gist.githubusercontent.com/thisismitch/3429023e8438cc25b86c/raw/d8c479e2a1adcea8b1fe86570e42abab0f10f364/filebeat-index-template.json
		
	9.2 Load the template with this command:

		curl -XPUT 'http://localhost:9200/_template/filebeat?pretty' -d@filebeat-index-template.json

		
10. Installation and Configuration of filebeat

	10.1 Login into your sandbox as root, create run the following command to import the Elasticsearch public GPG key into rpm:
	
		rpm --import http://packages.elastic.co/GPG-KEY-elasticsearch
		
	10.2 Create elastic-beats.repo under /etc/yum.repos.d in yout sandbox for Elasticsearch, and add the following to this file:
	
		[beats]
		name=Elastic Beats Repository
		baseurl=https://packages.elastic.co/beats/yum/el/$basearch
		enabled=1
		gpgkey=https://packages.elastic.co/GPG-KEY-elasticsearch
		gpgcheck=1
		
		
	10.3 Install filebeat with this command:

		yum -y install filebeat
		
		
	10.4 Configure Filebeat

		edit Filebeat configuration file: /etc/filebeat/filebeat.yml
		
		under the output section, find the line that says elasticsearch:, which indicates the Elasticsearch output section 
		(which we are not going to use). Delete or comment out the entire Elasticsearch output section (up to the line that says logstash:).
		
		Find the commented out Logstash output section, indicated by the line that says #logstash:, and uncomment it by deleting the preceding #. 
		In this section, uncomment the hosts: ["localhost:5044"] line
		
		
	10.5 Run the following command to start/stop or check the status of filebeat:

		service filebeat start
		service filebeat stop
		service filebeat status 
		
11. Connect to Kibana

	http://localhost:8744/
	
	Create Index Pattern 'sparklabs'
	
	



			